A preliminary analysis of collocational differences in monolingual comparable corpora
نویسندگان
چکیده
1. Introduction The notion of collocation has enjoyed mixed fortunes in the 50 odd years of its existence. Claimed to be obscure (Lyons, 1977: 612), counterproductive (Langendoen 1968: 63ff) and generally useless (Lehrer 1974) 1 by its detractors, the idea that part of the meaning of a word is somehow related to its " word accompaniment, the other word material in which [it is] most commonly or most characteristically embedded " has also had its supporters. These have suggested that words have no meaning out of context, and meaning itself is not contained anywhere, but rather dispersed as the " light of mixed wavelengths into a spectrum " (Firth 1957(1951): 192). The intuitive appeal of this view is evident if one thinks of the difficulty a compositional approach to meaning has (even allowing for subcategorisation frames) in explaining the patterned quality of language performance, as found in a corpus, and ultimately the speaker's or writer's effortless routine handling of co(n)textual restrictions. The hypothesis that " everything we say may be in some degree idiomatic – that […] there are affinities among words that continue to reflect the attachments the words had when we learned them, within larger groups " (Bolinger, 1976:102) provides a powerful argument in favour of the empirical study of collocations, with implications for theoretical, descriptive and applied branches of linguistics. In recent years, notwithstanding the vagueness of the notion and consequent methodological problems in investigating it empirically, the study of collocations has indeed defied difficulties and criticism and sparked renewed interest in a number of areas ranging from computational and corpus linguistics to lexicography, language pedagogy, and crucially for our purposes, translation studies. The hypotheses that " everything we say may be in some degree idiomatic " (Bolinger, above), and that " actual usage plays a very minor role in one's consciousness of language " (Sinclair 1991:39) raise a number of interesting questions for translation research. Is there any evidence that translators be aware of collocational restrictions in the source and target languages? Do they show sensitivity to phraseological (a)typicality and restrictedness? These are very complex issues, that can hardly be resolved in one fell swoop. For a start, theoretical as well as methodological problems remain as to what collocations are in the first place, 2 and how best they can be retrieved from corpora and compared (see e.g. Krenn 2000a). Secondly, different types of corpora for …
منابع مشابه
Modeling bilingual word associations as connected monolingual networks
Word associations are a common tool in research on the mental lexicon. Studies report that bilinguals produce different word associations in their non-native language than monolinguals, and propose at least three mechanisms responsible for this difference: bilinguals may rely on their native associations (through translation), on collocational patterns, and on the phonological similarity betwee...
متن کاملExtracting Lay Paraphrases of Specialized Expressions from Monolingual Comparable Medical Corpora
Whereas multilingual comparable corpora have been used to identify translations of words or terms, monolingual corpora can help identify paraphrases. The present work addresses paraphrases found between two different discourse types: specialized and lay texts. We therefore built comparable corpora of specialized and lay texts in order to detect equivalent lay and specialized expressions. We ide...
متن کاملA Particle Swarm Optimizer to Cluster Parallel Spanish-English Short-text Corpora Un Optimizador basado en Cúmulo de Part́ıculas para el Agrupamiento de Textos Cortos de Colecciones Paralelas en Español-Inglés
Short-texts clustering is currently an important research area because of its applicability to web information retrieval, text summarization and text mining. These texts are often available in different languages and parallel multilingual corpora. Some previous works have demonstrated the effectiveness of a discrete Particle Swarm Optimizer algorithm, named CLUDIPSO, for clustering monolingual ...
متن کاملImproved Statistical Machine Translation Using Monolingually-Derived Paraphrases
Untranslated words still constitute a major problem for Statistical Machine Translation (SMT), and current SMT systems are limited by the quantity of parallel training texts. Augmenting the training data with paraphrases generated by pivoting through other languages alleviates this problem, especially for the so-called “low density” languages. But pivoting requires additional parallel texts. We...
متن کاملCollocational patterning in cross-linguistic perspective: adpositions in English, Nepali, and Russian
This paper presents a contrastive analysis of adpositions in English, Nepali and Russian corpora. Two sets of highly frequent adpositions, those with broadly locative and broadly ablative meaning, are contrasted. The ‘quantitative-distributional’ analysis is based on identifying patterns across the most statistically significant collocations of the words in question; it is undertaken using 1 mi...
متن کامل